About This Project
Purpose: This demonstration project showcases the capability to transform messy hospital data into FHIR-compliant format with comprehensive quality assurance.
Approach: Uses official German MII (Medizininformatik-Initiative) test data to demonstrate understanding of the German medical informatics ecosystem and interoperability standards.
Project Overview
This project demonstrates a complete ETL pipeline that transforms messy hospital data into HL7 FHIR R4 compliant resources, using official German MII test data from the Medizininformatik-Initiative.
"Reverse Engineering" Approach for Validation
Started with clean MII FHIR → Created messy CSV with realistic quality issues → Transformed back to clean FHIR → Validated against original gold standard
Data Quality Issues Handled
Transformation Results
Success Rates by Resource Type
Resource Counts
Transformation Viewer
See exactly how the pipeline transforms messy CSV input into clean FHIR R4 resources. Each example uses real data from the pipeline.
- ID normalized:
WP1-00001→00001(stripped prefix) - Date converted:
01.01.1950(DD.MM.YYYY) →1950-01-01(ISO 8601) - Gender standardized:
Male→male(FHIR ValueSet) - Missing fields (FirstName, Street, PostalCode) gracefully omitted from FHIR output
- Date converted:
1953/01/01(YYYY/MM/DD) →1953-01-01(ISO 8601) - Gender mapped:
W(German abbreviation for weiblich) →female(FHIR ValueSet) - German character
ßpreserved correctly in UTF-8 address
- Gender mapped:
weiblich(German for female) →female(FHIR ValueSet) - Missing LastName, Birthdate, Street — gracefully omitted from FHIR resource
- This patient demonstrates maximum missing data handling (4 of 8 fields empty)
- Missing CodeSystem inferred from ICD-10 code format →
http://fhir.de/CodeSystem/bfarm/icd-10-gm - Patient reference normalized:
PAT00002→Patient/00002 - Clinical status defaulted to
active(not present in CSV) - Date converted:
2019/01/05→ ISO 8601 format
- Patient reference normalized:
WP1-00001→Patient/00001 - ATC code
N06AA09linked to Medication resource reference - Missing Status defaulted to
completed - Date converted:
01-01-2019(DD-MM-YYYY) →2019-01-01(ISO 8601)
Data Quality Metrics
Validation Against Original FHIR
Quality Scores (click to expand)
male/female).
01-05-2019 (Jan 5 or May 1?) cannot be resolved without context. This is a realistic real-world ETL problem.
M80.01), the pipeline inferred http://fhir.de/CodeSystem/bfarm/icd-10-gm from the code format.
Data Quality Issues Handled
| Issue Type | Example Input (Messy) | Output (Clean FHIR) | Status |
|---|---|---|---|
| ID Format Variations | P-00001, PAT00002, 00003, WP1-00004 |
00001, 00002, 00003, 00004 |
Normalized |
| Date Format Variations | 01.01.1950, 1950/01/01, 01-01-1950 |
1950-01-01 (ISO 8601) |
Normalized |
| Gender Inconsistencies | M, m, male, männlich, W, weiblich |
male, female (FHIR valueSet) |
Normalized |
| Missing Data | ~12% random missing values |
Handled with defaults or omitted |
Partial |
| German Characters | Müller, Schröder, Weiß |
UTF-8 preserved |
Preserved |
Requirements Alignment
How this project and my broader experience map to core data engineering requirements.
Spark & NiFi
Datenbanken-Know-How
FHIR, LOINC, OMOP
Python, SQL, R, Java
Kubernetes, Spark, Ansible
Qualitätssicherung
Lebende Dokumentation
Medical context
Project = demonstrated in this repository Experience = from professional background
From Demo to Production Architecture
This demo uses a single-threaded Python pipeline. Below is how the same logic would be architected for production-scale medical data integration at a university hospital.
1 core, ~2 GB RAM
pandas + fhir.resources
Suitable for demonstration, prototyping, and validation of transformation logic.
Data provenance tracking
4–8 nodes, 200K+ patients
REST API + search
Research queries
Click any component to learn why it is needed
| Aspect | Demo Pipeline | Production Pipeline |
|---|---|---|
| Orchestration | Manual CLI execution | NiFi flow-based scheduling & monitoring |
| Processing | Single-threaded Python / pandas | Spark distributed across cluster nodes |
| Scale | 200 patients (~30 s) | 200K+ patients, horizontal scaling |
| Storage | Local JSON files | HAPI FHIR Server + Parquet data lake |
| Fault Tolerance | Script re-run on failure | Automatic retry, checkpointing, dead-letter queues |
| Monitoring | Console output + log files | NiFi dashboard, Spark UI, alerting |
| Research Export | JSON files for validation | OMOP CDM + FHIR Bulk Export for multi-site studies |
Based on industry best practices for FHIR data integration at German university hospitals and the MII consortium architecture.
Technical Implementation
Technologies Used
Key Features
- Data quality validation (completeness, consistency)
- Referential integrity checks
- German medical terminology (ICD-10-GM, ATC)
- MII Kerndatensatz profile compliance
- Comprehensive error handling
- Validation against source data
How to Run This Project
Quick Start
Requirements: Python 3.9+, ~2GB disk space for test data
Execution time: ~2-3 minutes for full pipeline (200 patients)